LLM Spatial Reasoning Performance Analysis¶
This notebook analyzes the performance of a language model on a spatial reasoning benchmark. It processes the results from three separate CSV files, each representing a different complexity level (low, medium, high).
The analysis includes:
- Overall and per-level accuracy.
- Accuracy broken down by the ground truth direction.
- A confusion matrix to identify specific error patterns.
This methodology is inspired by the evaluation techniques described in the paper by Cohn and Blackwell (2024).
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os
# Set plotting style for better aesthetics
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 7)
print("Libraries imported and plotting style set.")
Libraries imported and plotting style set.
# --- CONFIGURATION: Set the path to your main results folder ---
root_results_dir = 'results'
# --- END CONFIGURATION ---
all_dataframes = []
print(f"Scanning for 'analysis_summary.csv' files in: '{root_results_dir}'...")
# NEW: Use os.walk() to traverse all subdirectories
for subdir, dirs, files in os.walk(root_results_dir):
if 'analysis_summary.csv' in files:
file_path = os.path.join(subdir, 'analysis_summary.csv')
print(f" Loading file: {file_path}")
try:
# Load the csv
df = pd.read_csv(file_path)
# NEW: Extract model and level from the file path
# This assumes a path structure like 'results/model_name/level'
path_parts = os.path.normpath(subdir).split(os.sep)
if len(path_parts) >= 3:
df['model'] = path_parts[-2]
df['level'] = path_parts[-1].title() # Capitalize 'low' to 'Low', etc.
all_dataframes.append(df)
except Exception as e:
print(f" Warning: Could not read or process file {file_path}. Error: {e}")
# Combine all loaded data into a single master DataFrame
if all_dataframes:
master_df = pd.concat(all_dataframes, ignore_index=True)
# Clean and Standardize Data
for col in ['expected_answer', 'predicted_answer']:
if col in master_df.columns:
master_df[col] = master_df[col].astype(str).str.strip().str.title().str.replace(' In ', ' In-')
print(f"\nSuccessfully loaded and combined {len(all_dataframes)} result file(s).")
print(f"Total prompts analyzed: {len(master_df)}")
print("\nData Head:")
display(master_df.head())
else:
print("\nNo 'analysis_summary.csv' files were found. Please check the directory path and structure.")
master_df = pd.DataFrame() # Create empty dataframe to avoid errors in later cells
Scanning for 'analysis_summary.csv' files in: 'results'... Loading file: results/o4-mini/high/analysis_summary.csv Loading file: results/o4-mini/low/analysis_summary.csv Loading file: results/o4-mini/medium/analysis_summary.csv Loading file: results/gpt-4.1/high/analysis_summary.csv Loading file: results/gpt-4.1/low/analysis_summary.csv Loading file: results/gpt-4.1/medium/analysis_summary.csv Loading file: results/gpt-4.1-mini/high/analysis_summary.csv Loading file: results/gpt-4.1-mini/low/analysis_summary.csv Loading file: results/gpt-4.1-mini/medium/analysis_summary.csv Loading file: results/deepSeek-v3/high/analysis_summary.csv Loading file: results/deepSeek-v3/low/analysis_summary.csv Loading file: results/deepSeek-v3/medium/analysis_summary.csv Loading file: results/gemini-2.5-flash/high/analysis_summary.csv Loading file: results/gemini-2.5-flash/low/analysis_summary.csv Loading file: results/gemini-2.5-flash/medium/analysis_summary.csv Loading file: results/gemini-2.5-flash-lite-preview-06-17/high/analysis_summary.csv Loading file: results/gemini-2.5-flash-lite-preview-06-17/low/analysis_summary.csv Loading file: results/gemini-2.5-flash-lite-preview-06-17/medium/analysis_summary.csv Successfully loaded and combined 18 result file(s). Total prompts analyzed: 8840 Data Head:
| prompt_id | model | expected_answer | predicted_answer | is_correct | complexity_level | time_taken | grid_size | level | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | o4-mini | Behind-Left | Behind | 0 | high | 21.52 | 10 | High |
| 1 | 2 | o4-mini | Behind-Left | Behind | 0 | high | 25.20 | 10 | High |
| 2 | 3 | o4-mini | Behind-Left | Behind | 0 | high | 21.82 | 10 | High |
| 3 | 4 | o4-mini | Behind-Left | Behind | 0 | high | 14.57 | 10 | High |
| 4 | 5 | o4-mini | Behind-Left | Behind | 0 | high | 26.95 | 10 | High |
if not master_df.empty:
# Ensure the 'is_correct' column is numeric
master_df['is_correct'] = pd.to_numeric(master_df['is_correct'], errors='coerce')
print(f"--- CONSOLIDATED ANALYSIS REPORT ---")
# Overall Accuracy across everything
overall_accuracy = master_df['is_correct'].mean()
print(f"\nOverall Accuracy (All Models & Levels): {overall_accuracy:.2%}")
# NEW: Accuracy by Model
print("\nAccuracy by Model:")
accuracy_by_model = master_df.groupby('model')['is_correct'].mean().sort_values(ascending=False)
print(accuracy_by_model.to_string(float_format="{:.2%}".format))
# NEW: Accuracy by Level
print("\nAccuracy by Complexity Level:")
level_order = ['Low', 'Medium', 'High']
master_df['level'] = pd.Categorical(master_df['level'], categories=level_order, ordered=True)
accuracy_by_level = master_df.groupby('level')['is_correct'].mean().reindex(level_order)
print(accuracy_by_level.to_string(float_format="{:.2%}".format))
else:
print("No data available for analysis.")
--- CONSOLIDATED ANALYSIS REPORT --- Overall Accuracy (All Models & Levels): 25.15% Accuracy by Model: model o4-mini 66.35% deepSeek-v3 15.45% gpt-4.1-mini 14.90% gpt-4.1 11.75% gemini-2.5-flash-lite-preview-06-17 6.75% gemini-2.5-flash 0.00% Accuracy by Complexity Level: level Low 30.74% Medium 22.23% High 19.80%
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/1984550064.py:20: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
accuracy_by_level = master_df.groupby('level')['is_correct'].mean().reindex(level_order)
if not master_df.empty:
# --- Consolidated Accuracy by Model Bar Chart ---
plt.figure(figsize=(12, 7))
ax_model = sns.barplot(x=accuracy_by_model.index, y=accuracy_by_model.values, palette="plasma")
ax_model.set_title('Consolidated Accuracy by Model', fontsize=16)
ax_model.set_ylabel('Accuracy')
ax_model.set_xlabel('Model')
ax_model.set_ylim(0, max(1.0, accuracy_by_model.max() * 1.1))
plt.xticks(rotation=45, ha="right")
for p in ax_model.patches:
ax_model.annotate(f"{p.get_height():.2%}", (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 9), textcoords='offset points')
plt.tight_layout()
plt.savefig(os.path.join(root_results_dir, 'consolidated_accuracy_by_model.png'))
plt.show()
# --- Consolidated Confusion Matrix (Error Rows/Columns Removed) ---
plt.figure(figsize=(14, 12))
sns.set_context('talk')
valid_answers = [ans for ans in sorted(list(set(master_df['expected_answer'].astype(str)) | set(master_df['predicted_answer'].astype(str)))) if ans != 'Error']
df_filtered = master_df[master_df['expected_answer'].isin(valid_answers) & master_df['predicted_answer'].isin(valid_answers)]
if not df_filtered.empty:
confusion_matrix = pd.crosstab(
pd.Categorical(df_filtered['expected_answer'], categories=valid_answers, ordered=True),
pd.Categorical(df_filtered['predicted_answer'], categories=valid_answers, ordered=True),
rownames=['Actual Answer'], colnames=['Predicted Answer'], dropna=False
)
sns.heatmap(confusion_matrix, annot=True, fmt='d', cmap='YlGnBu', cbar=False, linewidths=.5, annot_kws={"size": 14})
plt.title('Consolidated Confusion Matrix (All Models & Levels)', fontsize=20)
plt.tight_layout()
plt.savefig(os.path.join(root_results_dir, 'consolidated_confusion_matrix.png'))
plt.show()
else:
print("No valid data to plot confusion matrix.")
sns.set_context('notebook') # Reset context
else:
print("No data available for visualization.")
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/1867300958.py:4: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax_model = sns.barplot(x=accuracy_by_model.index, y=accuracy_by_model.values, palette="plasma")
# --- NEW CELL: Consolidated Bar Graph by Complexity Level ---
print("Generating consolidated bar graph comparing models at each complexity level...")
if not master_df.empty:
# Get the unique complexity levels from the data, in the correct order
levels = sorted(master_df['level'].unique(), key=lambda x: ['Low', 'Medium', 'High'].index(x))
# Create a figure with subplots (e.g., 1 row, 3 columns)
fig, axes = plt.subplots(1, len(levels), figsize=(24, 8), sharey=True)
# If there's only one level, axes will not be an array, so make it one
if len(levels) == 1:
axes = [axes]
fig.suptitle('Comparative Model Accuracy by Complexity Level', fontsize=20, y=1.03)
# Loop through each complexity level and its corresponding subplot axis
for i, level in enumerate(levels):
ax = axes[i]
# Filter the main DataFrame for the current level
df_level = master_df[master_df['level'] == level]
if not df_level.empty:
# Calculate accuracy for each model at this level
accuracy_by_model_at_level = df_level.groupby('model')['is_correct'].mean().sort_values(ascending=False)
sns.barplot(ax=ax, x=accuracy_by_model_at_level.index, y=accuracy_by_model_at_level.values, palette='magma')
# Add percentage labels to each bar
for p in ax.patches:
ax.annotate(f"{p.get_height():.1%}",
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center',
xytext=(0, 9),
textcoords='offset points',
fontsize=12)
ax.set_title(f'Level: {level}', fontsize=22)
ax.set_xlabel('Model', fontsize=20)
ax.tick_params(axis='x', rotation=90, labelsize=20)
else:
ax.set_title(f'Level: {level} (No Data)', fontsize=16)
# Set the shared y-axis label only on the first plot
axes[0].set_ylabel('Accuracy', fontsize=22)
axes[0].set_ylim(0, 1.05)
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.savefig(os.path.join(root_results_dir, 'consolidated_accuracy_by_level_comparison.png'))
plt.show()
else:
print("No data available for consolidated bar graph.")
Generating consolidated bar graph comparing models at each complexity level...
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/3932238504.py:29: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_model_at_level.index, y=accuracy_by_model_at_level.values, palette='magma') /var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/3932238504.py:29: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_model_at_level.index, y=accuracy_by_model_at_level.values, palette='magma') /var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/3932238504.py:29: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_model_at_level.index, y=accuracy_by_model_at_level.values, palette='magma')
if not master_df.empty:
# Get a list of unique models found in the data
unique_models = master_df['model'].unique()
print(f"\n{'='*30}\nPER-MODEL DRILL-DOWN ANALYSIS\n{'='*30}")
for model_name in unique_models:
print(f"\n--- Analysis for: {model_name} ---")
# Filter the DataFrame for the current model
df_model = master_df[master_df['model'] == model_name]
# --- Accuracy by Direction Chart for this model ---
plt.figure(figsize=(14, 8))
direction_order = ['In-Front', 'In-Front-Right', 'Right', 'Behind-Right', 'Behind', 'Behind-Left', 'Left', 'In-Front-Left', 'incorrect prompt', 'unparseable', 'error']
accuracy_by_direction = df_model.groupby('expected_answer')['is_correct'].mean().reindex(direction_order).dropna()
ax_dir = sns.barplot(x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="coolwarm")
ax_dir.set_title(f'Accuracy by Direction for {model_name}', fontsize=16)
ax_dir.set_xlabel('Ground Truth (Expected Answer)', fontsize=12)
ax_dir.set_ylabel('Accuracy', fontsize=12)
ax_dir.set_ylim(0, 1.05)
plt.xticks(rotation=45, ha='right')
for p in ax_dir.patches:
ax_dir.annotate(f"{p.get_height():.1%}", (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 9), textcoords='offset points')
plt.tight_layout()
plt.savefig(os.path.join(root_results_dir, f'{model_name}_accuracy_by_direction.png'))
plt.show()
# --- Confusion Matrix for this model ---
plt.figure(figsize=(14, 12))
all_possible_answers = sorted(list(set(df_model['expected_answer']) | set(df_model['predicted_answer'])))
cm_model = pd.crosstab(
pd.Categorical(df_model['expected_answer'], categories=all_possible_answers, ordered=True),
pd.Categorical(df_model['predicted_answer'], categories=all_possible_answers, ordered=True),
rownames=['Actual Answer'], colnames=['Predicted Answer'], dropna=False
)
sns.heatmap(cm_model, annot=True, fmt='d', cmap='Blues', cbar=False, linewidths=.5)
plt.title(f'Confusion Matrix for {model_name}', fontsize=16)
plt.tight_layout()
plt.savefig(os.path.join(root_results_dir, f'{model_name}_confusion_matrix.png'))
plt.show()
else:
print("No data available for per-model analysis.")
============================== PER-MODEL DRILL-DOWN ANALYSIS ============================== --- Analysis for: o4-mini ---
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/612801279.py:18: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax_dir = sns.barplot(x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="coolwarm")
--- Analysis for: gpt-4.1 ---
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/612801279.py:18: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax_dir = sns.barplot(x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="coolwarm")
--- Analysis for: gpt-4.1-mini ---
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/612801279.py:18: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax_dir = sns.barplot(x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="coolwarm")
--- Analysis for: deepSeek-v3 ---
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/612801279.py:18: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax_dir = sns.barplot(x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="coolwarm")
--- Analysis for: gemini-2.5-flash ---
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/612801279.py:18: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax_dir = sns.barplot(x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="coolwarm")
--- Analysis for: gemini-2.5-flash-lite-preview-06-17 ---
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/612801279.py:18: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax_dir = sns.barplot(x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="coolwarm")
# --- Per-Model, Per-Complexity Drill-Down Analysis (Corrected Labels) ---
print(f"\n{'='*30}\nDETAILED ANALYSIS: ACCURACY PER-MODEL, PER-LEVEL\n{'='*30}")
if not master_df.empty:
unique_models = master_df['model'].unique()
# Loop through each model to create a dedicated figure for it
for model_name in unique_models:
print(f"\n--- Generating detailed charts for model: {model_name} ---")
df_model = master_df[master_df['model'] == model_name]
levels = sorted(df_model['level'].unique(), key=lambda x: ['Low', 'Medium', 'High'].index(x))
# Create a figure with subplots. Removed sharex=True to show all labels.
fig, axes = plt.subplots(len(levels), 1, figsize=(14, 9 * len(levels)), sharey=True)
if len(levels) == 1:
axes = [axes]
fig.suptitle(f'Model Accuracy by Direction and Complexity for\n{model_name}', fontsize=20, y=1.0)
direction_order = ['In-Front', 'In-Front-Right', 'Right', 'Behind-Right', 'Behind', 'Behind-Left', 'Left', 'In-Front-Left', 'incorrect prompt', 'unparseable', 'error']
# Loop through each complexity level for the current model
for i, level in enumerate(levels):
ax = axes[i]
df_level = df_model[df_model['level'] == level]
if df_level.empty:
ax.set_title(f'Level: {level} (No Data)', fontsize=16)
continue
accuracy_by_direction = df_level.groupby('expected_answer')['is_correct'].mean().reindex(direction_order).dropna()
if accuracy_by_direction.empty:
ax.set_title(f'Level: {level} (No Correct Answers to Plot)', fontsize=16)
continue
sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r")
for p in ax.patches:
ax.annotate(f"{p.get_height():.1%}", (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 9), textcoords='offset points', fontsize=12)
ax.set_title(f'Level: {level}', fontsize=16)
ax.set_ylabel('Accuracy', fontsize=14)
ax.set_ylim(0, 1.05)
# --- FIX ---
# Set the x-axis label and rotate the tick labels for EACH subplot
ax.set_xlabel('Ground Truth (Expected Answer)', fontsize=14)
ax.tick_params(axis='x', rotation=45, labelsize=12)
plt.tight_layout(rect=[0, 0.03, 1, 0.97])
plt.savefig(os.path.join(root_results_dir, f'{model_name}_detailed_accuracy_report.png'))
plt.show()
else:
print("No data available for detailed analysis.")
============================== DETAILED ANALYSIS: ACCURACY PER-MODEL, PER-LEVEL ============================== --- Generating detailed charts for model: o4-mini ---
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r") /var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r") /var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r")
--- Generating detailed charts for model: gpt-4.1 ---
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r") /var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r") /var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r")
--- Generating detailed charts for model: gpt-4.1-mini ---
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r") /var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r") /var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r")
--- Generating detailed charts for model: deepSeek-v3 ---
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r") /var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r") /var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r")
--- Generating detailed charts for model: gemini-2.5-flash ---
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r") /var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r") /var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r")
--- Generating detailed charts for model: gemini-2.5-flash-lite-preview-06-17 ---
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r") /var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r") /var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4148444071.py:40: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis_r")
# --- Per-Model, Per-Complexity Confusion Matrix Drill-Down (Error Rows/Columns Removed) ---
print(f"\n{'='*30}\nDETAILED ANALYSIS: CONFUSION MATRIX PER-MODEL, PER-LEVEL\n{'='*30}")
if not master_df.empty:
unique_models = master_df['model'].unique()
# --- FIX ---
# Define the valid answer categories once, outside the loop.
valid_answers = [
'In-Front', 'In-Front-Right', 'Right', 'Behind-Right',
'Behind', 'Behind-Left', 'Left', 'In-Front-Left',
'incorrect prompt'
]
# --- END FIX ---
for model_name in unique_models:
print(f"\n--- Generating confusion matrices for model: {model_name} ---")
df_model = master_df[master_df['model'] == model_name]
levels = sorted(df_model['level'].unique(), key=lambda x: ['Low', 'Medium', 'High'].index(x))
fig, axes = plt.subplots(1, len(levels), figsize=(12 * len(levels), 10), sharey=True)
if len(levels) == 1: axes = [axes]
fig.suptitle(f'Confusion Matrix by Complexity Level for\n{model_name}', fontsize=30, y=1.0)
for i, level in enumerate(levels):
ax = axes[i]
df_level = df_model[df_model['level'] == level]
if df_level.empty:
ax.set_title(f'Level: {level}\n(No Data)', fontsize=18)
continue
# --- FIX ---
# 1. Filter the data for this level to exclude 'error' values.
df_level_filtered = df_level[
df_level['expected_answer'].isin(valid_answers) &
df_level['predicted_answer'].isin(valid_answers)
]
# 2. Create the crosstab using the filtered data and the list of valid answers.
confusion_matrix_level = pd.crosstab(
pd.Categorical(df_level_filtered['expected_answer'], categories=valid_answers, ordered=True),
pd.Categorical(df_level_filtered['predicted_answer'], categories=valid_answers, ordered=True),
rownames=['Actual Answer'], colnames=['Predicted Answer'], dropna=False
)
# --- END FIX ---
sns.heatmap(confusion_matrix_level, annot=True, fmt='d', cmap='YlGnBu', cbar=False, ax=ax, linewidths=.5, annot_kws={"size": 14})
ax.set_title(f'Level: {level}', fontsize=22)
ax.set_ylabel('')
ax.set_xlabel('Predicted Answer', fontsize=20)
ax.tick_params(axis='x', rotation=90, labelsize=20)
ax.tick_params(axis='y', rotation=0, labelsize=20)
axes[0].set_ylabel('Actual Answer', fontsize=22)
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.savefig(os.path.join(root_results_dir, f'{model_name}_detailed_confusion_matrix.png'))
plt.show()
else:
print("No data available for detailed analysis.")
============================== DETAILED ANALYSIS: CONFUSION MATRIX PER-MODEL, PER-LEVEL ============================== --- Generating confusion matrices for model: o4-mini ---
--- Generating confusion matrices for model: gpt-4.1 ---
--- Generating confusion matrices for model: gpt-4.1-mini ---
--- Generating confusion matrices for model: deepSeek-v3 ---
--- Generating confusion matrices for model: gemini-2.5-flash ---
--- Generating confusion matrices for model: gemini-2.5-flash-lite-preview-06-17 ---
# --- Configuration: Set your CSV filenames here ---
# These filenames match the files you uploaded.
file_low_complexity = 'results/o4-mini/low/analysis_summary.csv'
file_medium_complexity = 'results/o4-mini/medium/analysis_summary.csv'
file_high_complexity = 'results/o4-mini/high/analysis_summary.csv'
# --- END CONFIGURATION ---
try:
# Load each CSV into a pandas DataFrame
df_low = pd.read_csv(file_low_complexity)
df_medium = pd.read_csv(file_medium_complexity)
df_high = pd.read_csv(file_high_complexity)
# Add a 'complexity_level' column to each DataFrame before combining
# This is crucial for per-level analysis later.
df_low['complexity_level'] = 'Low'
df_medium['complexity_level'] = 'Medium'
df_high['complexity_level'] = 'High'
# Combine all DataFrames into a single master DataFrame
all_results_df = pd.concat([df_low, df_medium, df_high], ignore_index=True)
# Display the first few rows and info to verify
print("Successfully loaded and combined all data.")
print(f"Total prompts analyzed: {len(all_results_df)}")
print("\nData Head:")
display(all_results_df.head())
print("\nData Info:")
all_results_df.info()
except FileNotFoundError as e:
print(f"Error: Could not find a file. Please make sure filenames are correct. Details: {e}")
Successfully loaded and combined all data. Total prompts analyzed: 2000 Data Head:
| prompt_id | model | expected_answer | predicted_answer | is_correct | complexity_level | time_taken | grid_size | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | o4-mini | behind-right | behind | 0 | Low | 5.80 | 10 |
| 1 | 2 | o4-mini | behind-right | behind | 0 | Low | 4.81 | 10 |
| 2 | 3 | o4-mini | behind-right | behind | 0 | Low | 6.45 | 10 |
| 3 | 4 | o4-mini | behind-right | behind | 0 | Low | 6.84 | 10 |
| 4 | 5 | o4-mini | behind-right | behind | 0 | Low | 5.22 | 10 |
Data Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 2000 entries, 0 to 1999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 prompt_id 2000 non-null int64 1 model 2000 non-null object 2 expected_answer 2000 non-null object 3 predicted_answer 2000 non-null object 4 is_correct 2000 non-null int64 5 complexity_level 2000 non-null object 6 time_taken 2000 non-null float64 7 grid_size 2000 non-null int64 dtypes: float64(1), int64(3), object(4) memory usage: 125.1+ KB
all_results_df['expected_answer'].unique()
array(['behind-right', 'in-front-right', 'in-front-left', 'behind-left',
'left', 'right', 'behind', 'in-front'], dtype=object)
# Ensure the 'is_correct' column is numeric for calculations
all_results_df['is_correct'] = pd.to_numeric(all_results_df['is_correct'], errors='coerce')
# Extract the model name from the first entry for reporting
model_name = all_results_df['model'].iloc[0] if not all_results_df.empty else 'Unknown Model'
print(f"--- ANALYSIS REPORT FOR: {model_name} ---")
# Overall Accuracy
overall_accuracy = all_results_df['is_correct'].mean()
print(f"\nOverall Accuracy: {overall_accuracy:.2%}")
# Accuracy by Complexity Level
accuracy_by_level = all_results_df.groupby('complexity_level')['is_correct'].mean().reindex(['Low', 'Medium', 'High'])
print("\nAccuracy by Complexity Level:")
print(accuracy_by_level.to_string(float_format="{:.2%}".format))
--- ANALYSIS REPORT FOR: o4-mini --- Overall Accuracy: 66.35% Accuracy by Complexity Level: complexity_level Low 74.00% Medium 58.25% High 67.25%
Accuracy by Ground Truth Direction¶
This bar chart shows the model's accuracy for each of the 8 possible correct answers. This helps identify if the model has a bias or weakness for specific directions (e.g., it might be better at 'In-Front' than 'Behind-Left'). This analysis is similar to Figure 2b in the Cohn & Blackwell paper.
plt.figure(figsize=(14, 8))
# Define a consistent order for directions for a more readable chart
direction_order = ['right', 'behind-right', 'in-front-left', 'in-front', 'left',
'behind', 'behind-left', 'in-front-right', 'incorrect prompt']
# Calculate accuracy for each ground truth direction
accuracy_by_direction = all_results_df.groupby('expected_answer')['is_correct'].mean().reindex(direction_order).dropna()
ax = sns.barplot(x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="coolwarm", order=accuracy_by_direction.index)
# Add percentage labels on top of each bar
for p in ax.patches:
ax.annotate(f"{p.get_height():.1%}", (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 9), textcoords='offset points')
ax.set_title(f'Accuracy by Spatial Direction for {model_name}', fontsize=16)
ax.set_xlabel('Spatial Direction', fontsize=12)
ax.set_ylabel('Accuracy', fontsize=12)
ax.set_ylim(0, 1.05)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
# Save the plot to a file
plt.savefig(f'{model_name}_accuracy_by_direction.png')
plt.show()
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/3805417496.py:10: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="coolwarm", order=accuracy_by_direction.index)
# Create a figure with 3 subplots arranged in a single column
# The figsize is adjusted to be taller to accommodate the three plots.
fig, axes = plt.subplots(3, 1, figsize=(14, 24), sharey=True)
fig.suptitle(f'Model Accuracy by Direction for Each Complexity Level ({model_name})', fontsize=20, y=1.02)
# Define the levels to iterate over
levels = ['Low', 'Medium', 'High']
for i, level in enumerate(levels):
# Filter the DataFrame for the current complexity level
df_level = all_results_df[all_results_df['complexity_level'] == level]
ax = axes[i] # Select the subplot for this level
if not df_level.empty:
# Calculate accuracy for each ground truth direction for this level
accuracy_by_direction = df_level.groupby('expected_answer')['is_correct'].mean().reindex(direction_order).dropna()
# Create the bar plot on the specific subplot
sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis", order=accuracy_by_direction.index)
# Add percentage labels to each bar
for p in ax.patches:
ax.annotate(f"{p.get_height():.1%}", (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 9), textcoords='offset points')
ax.set_title(f'Complexity Level: {level}', fontsize=16)
ax.set_ylabel('Accuracy', fontsize=12)
ax.set_ylim(0, 1.05)
ax.tick_params(axis='x', rotation=45)
else:
ax.set_title(f'Complexity Level: {level} (No Data)', fontsize=16)
ax.text(0.5, 0.5, 'No data available for this level.', ha='center', va='center')
# Set the x-axis label only for the bottom plot to avoid repetition
axes[-1].set_xlabel('Spatial Direction', fontsize=12)
# Adjust layout to prevent titles and labels from overlapping
plt.tight_layout(rect=[0, 0, 1, 0.98])
# Save the combined figure to a file
plt.savefig(f'{model_name}_accuracy_by_level_and_direction.png')
plt.show()
/var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4188039369.py:21: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis", order=accuracy_by_direction.index) /var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4188039369.py:21: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis", order=accuracy_by_direction.index) /var/folders/cy/g384q62d2zsbh3mc3hcvn4dr0000gn/T/ipykernel_10324/4188039369.py:21: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(ax=ax, x=accuracy_by_direction.index, y=accuracy_by_direction.values, palette="viridis", order=accuracy_by_direction.index)
Confusion Matrix¶
The confusion matrix is one of the most powerful tools for error analysis. It shows us exactly what the model predicted when the actual answer was something else. The diagonal shows correct answers, while off-diagonal cells highlight the model's specific mistakes. [cite_start]This analysis is similar to that shown in Figure 1d and Figure 3 of the reference paper .
plt.figure(figsize=(8, 8))
# Define a consistent order for all possible answers to ensure the matrix is square
all_possible_answers = sorted(list(set(all_results_df['expected_answer']) | set(all_results_df['predicted_answer'])))
# Create the confusion matrix using pandas.crosstab
confusion_matrix = pd.crosstab(
pd.Categorical(all_results_df['expected_answer'], categories=all_possible_answers, ordered=True),
pd.Categorical(all_results_df['predicted_answer'], categories=all_possible_answers, ordered=True),
rownames=['Actual Answer (Expected)'],
colnames=['Predicted Answer'],
dropna=False
)
# Visualize the matrix as a heatmap
sns.heatmap(confusion_matrix, annot=True, fmt='d', cmap='Blues', cbar=False, linewidths=.5)
plt.title(f'Confusion Matrix for {model_name}', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
# Save the plot to a file
plt.savefig(f'{model_name}_confusion_matrix.png')
plt.show()
# --- Visualization: Confusion Matrix by Complexity Level (Large Font) ---
print("Generating Confusion Matrices for each complexity level with larger fonts...")
# Set a larger context for the plot, which increases font sizes and line widths globally.
# 'talk' is a good setting for readability. 'poster' is even larger.
sns.set_context('talk')
# Get the unique levels from the data in a defined order
levels = ['Low', 'Medium', 'High']
# Create a figure with subplots (1 row, 3 columns). Increased figsize for better spacing.
fig, axes = plt.subplots(1, len(levels), figsize=(30, 10), sharey=True)
# Set a single, overarching title for the entire figure with a larger font.
fig.suptitle(f'Confusion Matrix by Complexity Level for {model_name}', fontsize=24, y=1.02)
# Define a consistent order for all possible answers across all subplots
all_possible_answers = sorted(list(set(all_results_df['expected_answer']) | set(all_results_df['predicted_answer'])))
# Loop through each complexity level and its corresponding subplot axis
for i, level in enumerate(levels):
ax = axes[i]
# Filter the main DataFrame for the current level
df_level = all_results_df[all_results_df['complexity_level'] == level]
if df_level.empty:
ax.set_title(f'Level: {level}\n(No Data Available)', fontsize=18)
ax.text(0.5, 0.5, 'No data to plot.', ha='center', va='center')
ax.set_xlabel('')
continue
# Create the confusion matrix for this specific level's data
confusion_matrix_level = pd.crosstab(
pd.Categorical(df_level['expected_answer'], categories=all_possible_answers, ordered=True),
pd.Categorical(df_level['predicted_answer'], categories=all_possible_answers, ordered=True),
rownames=['Actual Answer (Expected)'],
colnames=['Predicted Answer'],
dropna=False
)
# Visualize the heatmap on the specific subplot (ax)
# Added annot_kws to control the font size of the numbers in the heatmap
sns.heatmap(
confusion_matrix_level,
annot=True,
fmt='d',
cmap='viridis',
cbar=False,
ax=ax,
linewidths=.5,
annot_kws={"size": 16} # Increase font size of the numbers
)
ax.set_title(f'Level: {level}', fontsize=22)
# Increase tick label sizes
ax.tick_params(axis='x', rotation=90, labelsize=18)
ax.tick_params(axis='y', rotation=0, labelsize=18)
# Set shared axis labels with a larger font
axes[0].set_ylabel('Actual Answer (Expected)', fontsize=22)
axes[1].set_xlabel('Predicted Answer', fontsize=22) # Set label on middle plot for balance
# Adjust layout and show the plot
plt.tight_layout(rect=[0, 0.05, 1, 0.96]) # Adjust rect to make space for suptitle
plt.savefig(f'{model_name}_confusion_matrix_by_level_large.png', dpi=150) # Save at higher DPI
plt.show()
# Reset context back to default 'notebook' size for any subsequent plots
sns.set_context('notebook')
Generating Confusion Matrices for each complexity level with larger fonts...